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In brief 

Plassais et al. assemble a catalog of 
thousands of genomes, inclusive of 
ancient and modern canids, in a search 
for genetic variants passed from ancient 
to modern dogs. They identify an ancient 
mutation at the /GF7 locus, which has 
been under human selection, that 
contributes to a significant portion of 
body size in modern dogs. 
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SUMMARY 


Domestic dogs (Canis lupus familiaris) are the most variable-sized mammalian species on Earth, displaying a 
40-fold size difference between breeds.’ Although dogs of variable size are found in the archeological re- 
cord,” the most dramatic shifts in body size are the result of selection over the last two centuries, as dog 
breeders selected and propagated phenotypic extremes within closed breeding populations.° Analyses of 
over 200 domestic breeds have identified approximately 20 body size genes regulating insulin processing, 
fatty acid metabolism, TGFB signaling, and skeletal formation.>"'° Of these, insulin-like growth factor 1 
(IGF1) predominates, controlling approximately 15% of body size variation between breeds.® The identifica- 
tion of a functional mutation associated with IGF1 has thus far proven elusive.°'®"' Here, to identify and 
elucidate the role of an ancestral /GF7 allele in the propagation of modern canids, we analyzed 1,431 genome 
sequences from 13 species, including both ancient and modern canids, thus allowing us to define the evolu- 
tionary history of both ancestral and derived alleles at this locus. We identified a single variant in an antisense 
long non-coding RNA (/GF1-AS) that interacts with the IGF1 gene, creating a duplex. While the derived mu- 
tation predominates in both modern gray wolves and large domestic breeds, the ancestral allele, which pre- 
disposes to small size, was common in small-sized breeds and smaller wild canids. Our analyses demon- 
strate that this major regulator of canid body size nearly vanished in Pleistocene wolves, before its recent 
resurgence resulting from human-imposed selection for small-sized breed dogs. 


gay Current Biology 32, 889-897, February 28, 2022 Published by Elsevier Inc. 889 
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Figure 1. Insulin-like growth factor 1 (IGF1) in Canidae 


(A) IGF1-AS variant genotypes and body mass range collected from 1,162 dogs of 230 breeds. Dots represent outliers. Blue diamonds indicate breed body mass 
averages, boxplots represent interquartile ranges, and black horizontal bars show median for each (**“p < 0.0001, Mann-Whitney-Wilcoxon tests). 
(B) Distribution of IGF1-AS alleles in three schnauzer breeds and poodle varieties. Pie chart indicates population proportion based on genotypes. Red, CC; 


orange, CT; yellow, TT. 


(C) Body mass and serum levels of IGF-1 protein (nmol/L) as functions of GF1-AS genotype. IGF-1 serum protein levels were assayed in 51 dogs, including 13 
mixed-breed dogs (*p < 0.01, **p < 0.001, ***p < 0.0001, Mann-Whitney-Wilcoxon tests). 
(D) Positive correlation observed between body mass and IGF-1 serum level (Rho Spearman test). Blue line shows the regression line; gray area represents 


confidence interval. 
See also Table S1 and Data S1. 


RESULTS AND DISCUSSION 


To identify functional mutation(s) at the /GF7 locus that explain 
body size differences in both modern and historical canids, we 
analyzed 1,431 genomes representing 13 species that encom- 
passed ancient canines, modern breed dogs, and wild canids. 
We generated a catalog of 1,297 modern dog genomes from 
230 breeds (1,156 dogs), 140 indigenous and village dogs from 
around the world, and one dingo (Data S1A), from which we iden- 
tified 64.92 million biallelic variants, including small indels. Using 
data from a maximum of four individuals per modern breed (two 
males and two females), resulting in a total of 456 individuals 
from 179 breeds, we calculated the association with body size 
at the locus surrounding the IGF7 gene on CFA15 (41.20-41.27 
Mb in Canfam 3.1). 

The top 10 most associated variants on CFA15 displayed a high 
degree of linkage disequilibrium (LD), with p values driven largely 
by ten dogs from three non-European breeds: '* four Chow Chows, 
two Afghan hounds, and four Tibetan mastiffs (Table S1; Data 
S1A). Among the 10 most associated variants, we identify a previ- 
ously reported intronic SNP (ts22437444),'°'' as well as a new 
candidate SNP (rs22397284; chr15:41219654.g.T<C, p = 1072). 
Unlike the remaining ten most associated genome-wide associa- 
tion study (GWAS) markers and a previously reported intronic 
SINE element,'°'' SNP 1822397284 is polymorphic in other wild 
canid species (Tables S1 and S2), demonstrating the highest sig- 
nificant association with body mass in wild canids at the /GF7 locus 
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(n = 80, p = 10~'%). Of note, using a set of 19 wild canids, 255 do- 
mestic dogs, and 58 village dogs, we did not detect any copy-num- 
ber variations associated with body size variations on /GF1 locus 
(Data S1B). While there exists the possibility that variants polymor- 
phic only in dogs, suchas the SINE element, play functional roles in 
body size regulation in domestic dogs, we focused our study on 
the new candidate SNP, rs22397284, as it is the only variant we 
identified that is associated with body size variation in both dogs 
and the other canid species analyzed here (Table S2). 

We first observed that 75% (3 quartile) of domestic dogs 
homozygous for the C allele of rs22397284 have a breed 
body mass average (BMA) < 15 kg (herein defined as “small 
breeds”; Figure 1A), while 75% of dogs homozygous for the T 
allele have a BMA > 25 kg (1° quartile; defined as “large 
breeds”; Data S1A). We confirmed these results via Sanger 
sequencing of the SNP in 144 poodle varieties (standard, mini- 
ature, and toy) and in three distinct schnauzer breeds: 48 mini- 
atures, 42 standards, and 48 giants (Data S1C). The schnauzer 
breeds differ in body mass by up to 6-fold between miniature 
and giant yet were likely developed from a close lineage. '*''® 
Miniature and toy poodles and the miniature and standard 
schnauzers are largely homozygous for the small size-associ- 
ated C allele (C allele frequency > 0.95) (Table S1; Data SiC), 
and giant schnauzers are fixed for the large size-associated T 
allele (Figure 1B). Standard poodles, however, possess both al- 
leles in equal frequency, perhaps reflecting a lack of strong se- 
lection for size in the large poodle variety. '* 
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Figure 2. Detection of IGF1-AS variants in ancient and modern genomes 
(A) Map of DNA sampling locations for 35 ancient canids, colored by their genotypes. Circles, dogs; triangles, wolves. Data were merged when several samples 
were collected from the same site with the same predicted age. Number of samples are indicated between brackets. Ages are given in thousand years before 


present (k). 


(B) Genotypes for the IGF1-AS variant in 13 species: 92 whole-genome sequences and 58 DNA samples that were Sanger sequenced for the /GF1-AS variant. 
Map demonstrates a north/south geographic gradient of alleles corresponding to body size. 


See also Figures S1-S3 and Tables S1-S3. 


Next, we Sanger sequenced the candidate variant in 51 dogs, 
including 13 mixed-breed dogs, for which we also measured 
both exact body mass and IGF-1 serum level (Figure 1C). We 
observed significant relationships between genotype, body 
mass, and IGF-1 serum level (p < 0.001, Mann-Whitney- 
Wilcoxon tests) and a direct correlation between body mass 
and IGF-1 serum level (p = 10~°, p = 0.6, Spearman test) (Fig- 
ure 1D). These results confirm that this non-coding variant could 
impact IGF-1 production through a regulatory mechanism, 
particularly since this SNP is located within the last exon of a pre- 
dicted 1,204 bp long non-coding RNA (IncRNA), which is itself an 
antisense of the /GF7 gene (herein referred to as IGF1-AS). 
Because of the strong association between both C and T alleles 
with body mass variation observed across breeds, we termed 
the “small allele” the small body mass-associated C allele and 
the “large allele” the large body mass-associated T allele. 

To better understand the origin of the /GF1-AS variant, we 
extended our analysis to include 33 previously published ancient 
dog genomes*"*:'*~'® (Figure 2A; Table S3). In order to account 
for low coverage and expected DNA damage (i.e., as the IGF7- 
AS variant is a transition) that often characterize ancient ge- 
nomes, we computed the posterior probability of each genotype 
(PP) under different priors, with or without re-scaling base quality 
scores based on the likely damaged positions’? (STAR 


Methods). We found the variant alleles were heterozygous in a 
previously described ~9,500-year-old Siberian sled dog.'° In 
addition, 50% of ancient dog genomes dating from 10,930 to 
100 years before present (yop) were homozygous for the small 
allele (n = 13; PP > 0.9), while 32% were homozygous for the 
large allele (n = 9; PP > 0.9). Surprisingly then, both small and 
large alleles have been segregating in dogs for at least 9,500 
years. 

The body mass of many of the archaeological dogs has been 
estimated,*~* and our characterization of the large and small al- 
leles correlates with the body mass of ancient dogs. We first esti- 
mated a body mass of 24.8 kg for the ~9,500-year-old heterozy- 
gous dog using direct measures of the mandible (Figures S1A 
and S1B; STAR Methods).?°*' We also found that three Israeli 
dogs (~2,300 ybp), estimated to weigh ~14.6 kg (based on the 
methodology described in Harcourt and Wing;* Figure S1C), all 
possessed the small allele (Table S3). A pre-contact American 
dog sample from Newfoundland (~4,000 ybp), described as a 
large dog,° was homozygous for the large allele. In addition, 
our analyses indicated that the frequency of the large allele 
was higher in ancient dogs excavated from northern latitude 
sites (latitude > 55°N; n = 8; freq(T) = 0.75), while ancient dogs 
from southern latitude sites (latitude < 45°N, mostly from the 
Mediterranean region) were more likely to possess the small 
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allele (n = 12; freq(C) = 0.79) (Figure 2A). This fits with what is 
known as Bergmann’s rule, which states that populations and 
species of small size live in comparatively warmer climates while 
larger species and populations tend to live in colder cli- 
mates.°*:*° The antiquity of these alleles and their geographic 
distribution in ancient dogs suggest that each has been segre- 
gating in the ancestors of dogs, which could also explain the 
observed significant associations between these alleles and 
body mass variation observed in other canid species (Table S1). 

To explore this hypothesis, we analyzed genome-wide data 
from nine ancient and 68 modern gray wolves from different lo- 
cations (Figure 2; Tables S2 and S3). We also genotyped 46 addi- 
tional modern gray wolves sampled from nine countries on three 
continents using Sanger sequencing (Table S2). As opposed to 
the previously reported variants in /GF1,'° and as we previously 
described in this work, our analyses indicate that the |GF1-AS 
variant identified here segregates not only in dogs, but also in 
both modern and ancient wolves where it is also associated 
with body size variation (Tables S1-S3). Indeed, we observe 
the small allele, albeit at low frequency, in ancient wolves (n = 
9, freq(C) = 0.16), and we also identify the small allele in a 
53,000-year-old Pleistocene Siberian wolf (heterozygous; 
PP(CT) > 0.9), further demonstrating the antiquity of the small 
allele (Figure 2A; Table S3). 

We next estimated body mass for three ancient wolves (Fig- 
ure S1B). We observe a Pleistocene wolf (16,500 ybp) that is ho- 
mozygous for the large allele with an expected body mass of 


892 Current Biology 32, 889-897, February 28, 2022 


~40,000-15,000 ybp - Domestication 
(oldest genotyped dog: 10,800 ybp) 
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dicted transcript (CFRNASEQ_AS_00037985). The 
C allele, associated with small sizes in canids and 
shared by the four species, corresponds to the 
ancestral allele. 

(B) Canidae ancestor was likely small and carried 
the C allele. The large allele arose some time 
before 53,000 years before present (ybp) and 
generated bigger animals (Canis /upus). The 
ancestral small allele continues to exist in the gray 
wolf population, albeit at a low frequency. 
Approximately 15,000 ybp, canine domestication 
likely began with large wolf-like dogs.” Shortly 
thereafter, human selection of small canids with 
the ancestral C allele led to a preponderance of 
small modern domestic breeds. Gray arrow re- 
flects actual hybridization observed between 
coyotes and wolves in eastern part of America. 
See also Figures S2 and S3. 


~39.6 kg, while ancient heterozygous 
wolves (52,500 and 16,900 ybp) have an 
expected mass of 21.8 and 38.1 kg, 
similar to modern heterozygous canids 
(Figure S1C). In addition, we used generalized linear models 
(GLMs) to test if associations exist between the distribution of 
the small allele, latitude, and temperature (Figures 2B). We 
observe that the frequency of the small allele is higher in modern 
smaller-size wolves (~25 kg)*“’° from lower latitudes (e.g., Mid- 
dle East, n = 11, freq = 0.47; Asia, n = 28, freq = 0.2) than in 
comparatively larger wolves (~40 kg)°° from higher latitudes 
(e.g., North America, n = 34, freq = 0.09; Europe, n = 28, freq = 
0.11; Siberia, n = 3, freq = 0; binomial GLM, latitude: AAIC = 
12.74, p < 0.005, Tjur’s R? = 0.23, temperature: AAIC = 11.76, 
p < 0.0005, Tjur’s R? = 0.34), which matches with the predictions 
made by Bergman’s rule. 

Although the large allele is more frequent in both modern and 
ancient gray wolves, the antiquity of the small allele makes it diffi- 
cult to determine which allele is ancestral. To address this, we 
analyzed 24 additional genomes from 11 distantly related canid 
species including four coyotes, two red wolves, five African 
golden wolves, one Ethiopian wolf, three African hunting dogs, 
three golden jackals, one black-backed jackal, one side-striped 
jackal, two dholes, one gray fox, and one Andean fox (Figure 2B; 
Table $2), representing a body mass range of 5 to 35 kg.”° We 
found strong statistical support for the relationship between lati- 
tude, temperature, and the distribution of the small allele in all 
canid populations tested (binomial GLMM: AAIC [null models] = 
16.09 and 15.33 for latitude and temperature, respectively). 
Small allele frequency in canid populations decreases with 
latitude and increases with temperatures (p < 0.0005, Tjur’s R® 
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Figure 4. Relationship between IGF1-AS variant genotype and individual body mass measures in coyotes 
(A) Mean body mass reported by U.S. state for 79 coyotes sampled by universities and museums, as indicated in STAR Methods. Circle size indicates number of 


individuals. 


(B) Frequency of the C allele of IGF1-AS variant in 76 samples (distinct from those in A; STAR Methods) drawn from eight states across the U.S. Both maps 
illustrate the west-to-east gradient for the coyote population supported by statistical models. 

(C and D) Linear (blue) and quadratic (red) relationships between longitude and body mass (C) or small allele frequency (D). Lines indicate predicted values from 
generalized linear models (with binomial error for small allele frequency and Gaussian error for body mass). In both cases, quadratic and linear effects received 
similar statistical support. West coast coyotes are primarily homozygous (C allele freq = 0.93, mean body mass = 9.18 kg + 2 SD). East coast coyotes carry all 
three genotypes (mean body mass = 16.03 kg + 3 SD). Mid-U.S. states (Nebraska and Oklahoma) were not included in these estimations. 

(E) Analysis of 28 coyotes from Pennsylvania demonstrates a significant relationship between /GF1-AS allele status and body mass (*p < 0.01, ***p < 0.0001, 
Mann-Whitney-Wilcoxon tests), but excludes the hypothesis of a local geographic effect on their distributions (at the state scale). 


See also Tables S1 and S2. 


= 0.64; STAR Methods). Interestingly, except for the two red 
wolves from North America (20-35 kg), all canids (coyotes, 
jackals, African wolves and hunting dogs, dholes, and foxes) 
possessed the small allele in a homozygous state, suggesting 
that the small allele is the ancestral state. 

We next performed a comparative genomic analysis that 
shows that 60%-70% of the genomic DNA defined by /GF1- 
AS exons in dogs is conserved among the most closely related 
mammals, as identified on an/GF7 maximum likelihood phyloge- 
netic tree generated by the Ensembl database (http://www. 
ensembl.org/) (Figure 3A; STAR Methods). Specifically, the small 
allele (C) was present in ferret, panda, and cat, supporting the hy- 
pothesis that the small allele represents the ancestral state. 

To further explore this observation, we genotyped the /GF1- 
AS loci using Sanger sequencing in 10 Channel Island and two 
gray foxes (Table S2), both species weighing 1.4-5.5 kg.°° We 
observe that all are homozygous for the small allele (freq = 1, 
n = 12). As in gray wolves, the previously described body size- 
associated SINE and SNP (rs22437444) originally identified in 
dogs'° do not show any association with size in small wild canids 
(Table S1). Conversely, for the /GF1-AS variant, all small canids 
(including small gray wolves) living in warmer regions carried the 
small allele, which suggests that /IGF1-AS may be a major 
contributor to body size variation in canids other than dogs. 


Finally, we obtained body mass data from 79 adult coyotes 
sampled from across North America (Table S2). We observed 
that coyote body size is variable, following a west-to-east 
gradient of small (west coast mean body mass = 9.18 + 2 kg) 
to large (east coast mean body mass 16.03 + 3 kg), as previously 
reported?’ (Figures 4A and 4C). 

Sanger sequencing of the IGF7-AS variant in 76 coyotes from 
locations spanning the U.S. revealed a high frequency of the small 
allele (freq = 0.93; n = 21) in west coast coyotes and a significantly 
lower frequency (freq = 0.47; n = 44; binomial GLM, p < 0.001; 
Tjurs R® = 0.21) among coyotes from the east coast 
(Figures 4B and 4D), where hybridization with wolves was recently 
described,”’ suggesting that the large allele was recently intro- 
gressed from wolves into coyotes (Figure 3B). Finally, we geno- 
typed 28 distinct coyotes from Pennsylvania state for which indi- 
vidual body mass data are available, demonstrating a significant 
relationship between /GF1-AS allele state and individual body 
mass (Mann-Whitney-Wilcoxon tests: p < 0.0001) (Figure 4E). 
These data demonstrate a strong association between /GF1-AS 
allele state and body size gradient in coyote populations across 
the U.S. Thus, at the /GF7 locus, the /IGF1-AS variant is likely 
the main canid body size mutation, such that small canids are ho- 
mozygous for the small allele and large gray wolves are homozy- 
gous for the large allele, as are large dogs. 
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Because of its variability among canid species and its low fre- 
quency in gray wolves, it is possible that the small allele was in- 
trogressed into dogs via gene flow from a small canid population. 
Using D-statistics,°° however, we found no evidence of excess 
allele sharing between small dogs and comparatively smaller 
wild canids (Figure S2). This supports recent findings suggesting 
that gene flow from wild canids is not a significant feature of the 
more recent evolutionary history of dogs.'* We did not detect 
any introgression from small wild canids (golden jackal, dhole, 
and African golden wolf) into gray wolves living in warmer tem- 
peratures (Figure S2), according to a previous report, '* which 
may indicate that hybridization between wolves and other small 
canids, except coyotes, remains rare outside of North America. 
Lastly, our analyses demonstrate that Middle Eastern gray 
wolves (Iran, Israel, Saudi Arabia, and Syria, n = 4) share a closely 
related haplotype containing the /GF1-AS small allele with small 
domestic dog breeds (Figure S3), thus supporting the idea of a 
common origin for the variant observed in small dogs and wild 
canids."' 

Finally, we validated the existence and the structure of the 
IGF1-AS |IncRNA using RNA sequencing (Figures 3A and S4; 
Data S1F). We confirmed that the last exon of /GF1-AS overlaps 
the third coding exon of IGF7 (Figure S4A) and that the candidate 
SNP is thus located 200 bp downstream of the last common 
nucleotide shared between IGF7 and IGF1-AS. We also showed 
that IGF1-AS interacts with IGF1 mRNA, creating a 182-bp 
InCRNA/IGF1 mRNA duplex (Figure S4B), and we do not detect 
differential expression between small and large dogs for either 
the IncRNA or /GF7 mRNA (Figure S4C). Knowing that antisense 
overlapping IncRNAs can regulate mRNA translation rate with no 
effect on MRNA levels,”°"*" it is possible that IGF1-AS could act 
as a regulatory mechanism associated with IGF-1 production, 
perhaps by affecting the affinity of a ribosomal binding motif 
for the C versus T allele. 

Altogether, our results indicate that the selection for small 
dogs targeted an ancestral allele at a SNP in an IncRNA that is 
antisense to the established body size gene /GF1.°'°* Our ana- 
lyses reveal that the large allele (T) likely arose in wolves more 
than 53,000 years ago*® (Figure 3B). The frequency of this allele 
then increased, likely due to natural selection in gray wolves dur- 
ing the Pleistocene, perhaps due to lower temperatures, and 
became fixed in northern latitude wolves, while the small allele 
persisted in wolves from southern latitudes. 

The latitudinal association of the two alleles also exists in 
ancient dogs from northern and southern Eurasia, suggesting 
that dogs may have either been under similar body size selective 
pressures or experienced gene flow with local wolves. The avail- 
ability of both the small- and large-sized associated alleles within 
the global dog population has also allowed dog breeders, begin- 
ning in the 19th century, '* to take advantage of the morpholog- 
ical plasticity conferred by these alleles to produce breed dogs of 
highly divergent sizes. Selection has also allowed for the near fix- 
ation of these alleles in modern large and small breeds. 

The human penchant for novelty has meant that domestic an- 
imal populations often possess phenotypic variability that may 
not have existed within the more homogeneous wild progeni- 
tors.°*°° Often, these novel characteristics only appeared after 
domestic animals became acclimated to the human niche and 
experienced a commensurate reduction in natural selection. 
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Thus, many targets of human selection have been driven to fre- 
quencies that would have been actively selected against in set- 
tings where humans had less influence over any individual’s sur- 
vival. The evidence presented here demonstrates that humans 
have also targeted standing variation present within wild ances- 
tors. In the case of dogs, the 40-fold size divergence has been 
driven in large part by selection on two divergent alleles in 
IGF1-AS, both segregating in wolves for over 53,000 years. 
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Materials availability 
This study did not generate new unique reagents. 


Data and code availability 


@ Genomes sequenced for this work, as well as all publicly available data used for alignment, are available via the Short Read 
Archive (https://www.ncbi.nim.nih.gov/sra) and their associated accession numbers are listed in the Data S1A and in the 
key resources table. Newly generated genomes are now available on SRA: PRJNA685036. RNA-Seq raw data are registered 
on SRA: PRJNA690861). The three newly sequenced ancient wolves are registered on The European Nucleotide Archive (ENA) 
under ENA: PRJEB42199. Raw data for tables and figures (including original gel pictures), and all raw data for statistics 
(including GWAS results, GLM, Wilcoxon-Mann-Whitney, Spearman) are publicly available on Dataverse. The DOI is listed in 
the key resources table and with the corresponding methods. Other data are contained within the article and its supplementary 
information. 

@ This paper does not report original code. 

e@ Any additional information required to reanalyze the data reported in this paper is available from the lead contacts upon request. 


EXPERIMENTAL MODEL AND SUBJECT DETAILS 


We extracted DNA from whole blood samples collected into EDTA or ACD anticoagulant from 331 dogs and 133 wild canids. Three 
ancient DNA were extracted from tooth or bone samples in a dedicated ancient DNA laboratory using appropriate sterile techniques 
and equipment. We extracted RNA from 42 testes collected by registered veterinarians during routine sterilization procedures with 
consent from the dog owner. All procedures were reviewed and approved by the Animal Care and Use Committee of the National 
Human Genome Reseach Institute (NHGRI) at the National Institutes of Health. We provide a full description of the specimens in 
the Methods Details. 


METHOD DETAILS 


Modern canids whole genome sequencing datasets 

WGS data was gathered from the Sequence Read Archive (SRA; http://www.ncbi.nim.nih.gov/sra), or Genome Sequence Archive 
(https://bigd.big.ac.cn/gsa/) (n = 965 unique individuals)(2),°°° ©? some of which was produced via the Dog10k project®® (n = 
371), or newly generated by the NIH Intramural Sequencing Center (n = 97) and are now available on NCBI: accession number 
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PRJNA685036. All Biosample numbers for the initial 1,297 modern dog genomes as well as coverage levels are listed in the Data S1A. 
To create the vef file containing the 1,297 modern dogs, sequence reads were aligned to the CanFam 3.1 reference genome (hitp:// 
genome.ucsc.edu/cgi-bin/hgGateway?db=canFam3) using the BWA-MEM algorithm’ (current version BWA 0.7.17) and sorted with 
SAMtools (current version SAMtools 1.6).°° 

For non-PCR-free libraries, PCR duplicates were marked as secondary reads using PicardTools (http://github.com/broadinstitute/ 
picard; current version PicardTools 2.9.2). GATK*”"“° (version GATK 4.1.4.0) was used to perform base recalibration using 2,738,537 
dbSNP v131 variants. SNVs and small indels were called using GATK HaplotypeCaller, which first calls variants per-individual in 
gVCF mode with subsequent joint-calling utilizing all individuals.°* Variant quality score recalibration was conducted with GATK 
best practices and default parameters for SNV and indels separately as follows: SNV recalibration: 172,254 Illumina Canine HD 
Chip variants (training, true, prior = 15); 2,738,537 dbSNP v131 variants (known, training, prior = 6); 3,627,539 published variants 
from Axelsson et al.°° (known, prior = 6). Indel recalibration: 714,278 variants as known, training and truth sets with a prior of six°° 
and maxGaussians set to 4. After alignment and variant calling, samples were removed if they were low quality, e.g., less than 2x 
average depth. 

The final datasets consisted of one VCF file of 75.6 million variants and contained 1,156 modern dog genomes encompassing 230 
breeds, 140 indigenous and village dogs (including 15 New Guinea singing dogs) and one dingo, sampled from around the world 
(Data S1A). We built a second VCF file containing only wild canid genomes (n = 86) obtained using SRA data from published pa- 
pers.®:°° © This file contains 68 gray wolves, four coyotes, two red wolves, five African golden wolves, one Ethiopian wolf, three Af- 
rican hunting dog, three golden jackals, one black-backed jackal, one side-striped jackal, two dholes, one gray fox and one Andean 
fox (Table S2), representing a weight range of five to 60 kg.”° In order to check IGF1-AS alleles in wild canids, as well as other pre- 
viously reported body size variants, ‘°° we only retained biallelic variants that were present in the domestic dog VCF file, generating 
a total of 64.9 million variants. Both VCF files are publically available on the NHGRI Dog Genome Project website (https://research. 
nhgri.nih.gov/dog_genome/data_release/index.shtml). 


Additional dog samples, Sanger sequencing and IGF-1 serum level 

Whole blood samples were collected into EDTA or ACD anticoagulant and genomic DNA was extracted using a standard phenol- 
chloroform extraction protocol. All procedures were reviewed and approved by the Animal Care and Use Committee of the National 
Human Genome Reseach Institute (NHGRI) at the National Institutes of Health. We obtained blood samples from 144 poodles (48 
standard, 48 miniature and 48 toy poodle variants) and 136 schnauzers (48 giant, 42 standard, and 48 miniature breeds) (Data 
SiC). The top 10 most associated variants with dog body size, including the /GF1-AS alleles, as well as a previously identified 
SINE element at the /GF1 locus, '° were validated using Sanger sequencing and agarose gel migration (1%), respectively. Primers 
were designed using Primer3plus®’ and are listed in the key resources table. Targeted regions were amplified using polymerase chain 
reaction (PCR) with KOD Xtreme HotStart Polymerase (Merck). PCR products were purified by ExoSap-ItTM reaction (Thermo Fisher 
Scientific) and sequenced using BigDye Terminator v3.1 (Thermo Fisher Scientific) on an ABI 3730 DNA analyzer (Applied Bio- 
systems). Sequence traces were analyzed using Phred/Phrap/Consed package.** “° Serum levels of IGF-1 in 51 dogs, including 
13 mixed dogs, were measured by ELISA (Veterinary diagnostic laboratory, Michigan State University) following standard methods 
(Data S1D). 


Additional modern canid samples 

We obtained DNA samples from 133 wild canids including 75 coyotes, 10 Channel Island foxes, two gray foxes, and 46 additional 
gray wolves sampled from nine countries on three continents (Table S2). Following the same protocol previously described, the 
IGF1-AS alleles and the nine other most associated variants with dog body size, as well as the previously identified SINE element 
at the IGF1 locus, '° were validated using Sanger sequencing and agarose gel migration (1%) respectively. Additional weight mea- 
sures for 79 adult coyotes were collected directly off websites from the University of Washington Burke Museum (UWBM), Museum of 
Vertebrate Zoology (MVZ), California Academy of Science (CAS), The Museum of Southwestern Biology (MSB), Denver Museum of 
Nature and Science (DMNS), Sam Noble Oklahoma Museum of Natural History (OMNH), Museum of Comparative Zoology (MCZ) and 
Princeton University with permissions (Table S2). 


Ancient DNA data, archaeological samples and context 

We obtained BAM files from 39 published ancient genomes representing six wolves and 33 dogs. '® For each sample we provide 
detailed information (individual ID/ archeological ID, species, age, location, depth, latitude, longitude, the associated reference and 
IGF1-AS genotype) in Table S3. Additional information regarding the archeological records can be found within their original asso- 
ciated papers referenced in Table S3. We completed this dataset with three additional unpublished wolf samples (AL2350, AL3185 
and AL2657), registered on The European Nucleotide Archive (ENA) under the permanent study accession number: PRJEB42199, 
with the following archeological samples and context: 

Botai, Kazakhstan (sample ID: AL2350) 

Botai is an Eneolithic settlement site in Northern Kazakhstan with early evidence of horse husbandry.°° Both dogs and wolves have 
been identified from the site. This wolf specimen was recovered in 2018 from a trash pit, adjacent to a pit house, alongside the bones 


2,3,14- 
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of horse and aurochs. The specimen consisted of a cranium, with attached mandibles, and the first three cervical vertebrae. The ante- 
rior portion of the snout had been removed from the middle of the tooth row, and the cranium displayed damage from putative pro- 
jectile injury. It is radiocarbon dated to 5,169 ybp. 

Pietrele, Giurgiu province, Romania (sample ID: AL3185) 

The site is ac. 9m high Chalcolithic tell-settlement situated close to the Danube. The occupation phases of the tell date to the period 
4,450 - 4,250 calBC.®° The wolf bone comes from the uppermost layers. It is radiocarbon dated to 6,307 ybp. 

Eliseevichi, W Russia (sample ID: AL2657) 

The Epigravettian Eliseevichi site is located in the Russian Plain on the right bank of the Sudost’ River, a tributary of the Desna. Based 
on two AMS dates on canid material’°’”' the calibrated age of the site is about 16,500 ybp. The faunal assemblage is dominated by 
woolly mammoth, reindeer, large canids and polar fox.’ Remains of eight complexes made from mammoth skulls and bones were 
recovered.’*:’ Two large canids skulls and one mandible are described as from Palaeolithic dogs.’”’° The analyzed mandible 
(AL3185 - 23781 (3)) is a complete left jaw, with a broken first molar, from an adult canid (Figure S1A). 


Ancient DNA sequencing 

DNA was extracted from tooth or bone samples in a dedicated ancient DNA laboratory using appropriate sterile techniques and 
equipment. Extraction was carried out following the Dabney extraction protocol’° but with the addition of a 30 min pre-digest stage. ’” 
Illumina libraries were built following Meyer and Kircher (2010)’° but with the addition of a six base-pair barcode added to the 
IS1_adapter.P5 and IS3_adapter.P5+P7 adaptor pair. The libraries were then amplified on an Applied Biosystems StepOnePlus 
Real-Time PCR system to check that library building was successful, and to determine the minimum number of cycles to use during 
the indexing amplification PCR reaction. A six base-pair barcode was used during the indexing amplification reaction resulting in 
each library being double-barcoded with an “internal adapter” directly adjacent to the ancient DNA strand and forming the first bases 
sequenced, with a traditional external barcode sequenced during Illumina barcode sequencing. We included negative blanks (no 
bone powder and nuclease free water) for every batch, these were them through the entire process assess for contamination. 
The three samples were then sequenced on multiple Hi-Seq 2500/4000 lanes and the paired-end sequencing data were aligned 
to the dog canFam3.1 genome using BWA’’ with permissive parameters including disabled seed’® (-I 16500 -n 0.01 -o 2). Contam- 
ination estimation process is fully-detailed by Bergstrom et al.” 


Validation of IGF1-AS transcripts 
We generated RNA-Seq data from 42 testes derived from various size breeds (Data S1F) that we registered on SRA under the acces- 
sion number PRJNA690861. RNA was extracted from testes using the RecoverAll Total Nucleic Acid Isolation Kit (Thermo Fisher Sci- 
entific) according to the manufacturer’s instructions. Quality score were measured by Agilent Bioanalyzer on a Total RNA 6000 Nano 
chip to obtain RIN score for RNA integrity. Illumina libraries were generated using the TruSeq Stranded mRNA LT-Set A (Illumina Cat 
No RS-122-2101) for the 42 samples, with all having unique barcodes. Library quality control was performed on an Agilent Bio- 
analyzer. Pooled samples were run on the NextSeq 550 (Illumina) using the NextSeq High Output v2.5 75 cycle kit (Illumina Cat 
No 20024906). We then used the nextflow-based RNASeq pipeline from the nf-core community (version 3.1; https://nf-co.re/ 
rnaseq/3.1) in order to uniformly process all testes RNaseq datasets. Briefly, the pipeline included QC of reads using the MultiQC 
tool,°' the mapping of the reads on both the genome (CanFam3.1) and the transcriptome (CanFam3.1-plus)*° with the STAR pro- 
gram°* and the detection of new transcripts with the Stringtie program (option-stringtie_ignore_gtf and merge).°° We also dou- 
ble-checked the strandness of the reads using the RSeQC tool (“‘infer_experiment.py” program)*° which confirmed that the data 
were stranded single-end utlizing a protocol where the reads come from the reverse strand (historically known as -fr-firststrand). 
In order to validate the structural annotation of the IncRNA /GF1-AS, we visualized all BAM files using the Integrative Genomics 
Viewer (I.G.V v2.8.2)°° (Figure S4A). We also manually checked the distribution of reads in the heterozygous samples. No allele spe- 
cific expression were observed between T and C alleles. To complete the RNA-Seq analysis and to confirm the presence of both 
predicted transcripts (CFRNASEQ_AS_ 00037985, CFRNASEQ_AS_00037987), we also performed reverse transcription with 1 jg 
of total RNA from testes using the High-Capacity cDNA Reverse Transcription kit (Thermo Fisher Scientific), according to the man- 
ufacturer’s instructions. We then Sanger sequenced cDNAs from ten dogs (five small and five large) using primer pairs specific for 
each transcript. Primers were designed using Primer3plus°’ and listed in the key resources table. Targeted regions were amplified 
using polymerase chain reaction (PCR) with KOD Xtreme HotStart Polymerase (Merck). PCR products were purified by ExoSap-ItTM 
reaction (Thermo Fisher Scientific) and sequenced using BigDye Terminator v3.1 (Thermo Fisher Scientific) on an ABI 3730 DNA 
analyzer (Applied Biosystems). Sequence traces were analyzed using Phred/Phrap/Consed package.** *° At the end, we confirmed 
that IGF1-AS transcripts contain three exons. The first transcript corresponds to a 1,204 bp RNA(CFRNASEQ_AS_00037985; 
chr15:41,101,001-41,219,825) while the second corresponds to a 1,001 bp RNA (CFRNASEQ_AS_00037987; chr15:41,214,777- 
41,219,825). Both transcripts only differ by their first exon, with different sequences. 


Ribonuclease Protection Assay (RPA) 

Total RNA (approximatively 5 11g) from six dogs (three small and three large) were digested with TURBODNase (Thermo Fisher Sci- 
entific) for 30 min at 16°C and RNase A/T1 Mix (Thermo Fisher Scientific) for one h and 30 min at 16°C to remove all the genomic DNA 
contamination and single-strand RNAs. RNA was purified after each step with the NucleoSpin RNAclean-up kit (Macherey-Nagel). 
The cDNA from endogenous double-stranded RNAs (dsRNA) was produced using the High-Capacity cDNA Reverse Transcription kit 
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(Thermo Fisher Scientific) and the mixture of the two gene-specific primers listed in the key resources table. The double-stranded 
cDNA was amplified in 25 ul PCR reaction system. After 35-cycle amplification, the products were checked by electrophoresis on 
2% agarose gel with SYBR Safe (Thermo Fisher Scientific). Note: Total RNAs used for these experiments were isolated from testes 
under nondenaturating conditions to preserve prospective natural RNA duplex. 


QUANTIFICATION AND STATISTICAL ANALYSIS 


Association analyses 
Only domestic dog samples with > 10x sequence coverage were retained, selecting a maximum of two males and two females per 
breed; those with the deepest coverage were selected when more than three individuals were available. All other samples were 
removed (including wild canids, village dogs, unknown and mixed breed samples), generating a dataset of 456 dogs representing 
179 breeds (Data S1A). For weight and height phenotypes we used the published breed standard, as has been done previously 
(male + female average).’°°° Standard breed weights (SBW) and height (SBH) were obtained from several sources: weights and 
height previously listed in Plassais et al.,° from the American Kennel Club Book of Standards, '* and the Fédération Cynologique In- 
ternationale (http://www.fci.be/en/Nomenclature/). SBW and SBH were applied to all samples from the same breed. For wild canids, 
we determined mean body mass for each species and wolf subspecies using several sources: body mass listed in Padilla & Hilton,°° 
from Lopez,®° Estes et al.,°' and the Wildlife Institute of India.°* We used veftools (-min-alleles 2-max-alleles 2-plink),*' retaining only 
biallelic variants (single nucleotide variants [SNP] and small indels < 200bp), generating a dataset of 64.9 million biallelic variants. 
For domestic dogs, we performed association studies using GEMMA v0.94.1*? as linear-mixed model methods, removing variants 
with missing values > 1%, and correcting each analysis by sex and a previously calculated relatedness matrix. We used the Wald test 
to determine P values and Bonferroni correction was used to identify significant associations (cutoff = —log1o (0.05/number of var- 
iants) = 8.46). For wild canids, we performed association studies using PLINK v1.9*° (using-assoc-adjust-geno 0.05 options). 
Throughout the paper, all violin plot P values were estimated by Mann-—Whitney-Wilcoxon tests (*p < 0.01, “p < 0.001, 
**~ < 0.0001). The relationship between IGF-1 serum level and body mass was tested using a Spearman correlation test (P value = 
2.8e-6, rho = 0.6) and violin plots were constructed in R (httos://www.r-project.org/). In addition, copy number variations were 
analyzed using the same dataset published in Serres Armero et al.°° from which we extracted the /GF7 locus (chr15:40500000- 
41500000). GWAS results for CFA15, CNV analysis, the raw data for figures and all statistics results are publicly available on Data- 
verse (https://doi.org/10.7910/DVN/JBXYZD) (since each GWAS result file contains ~370k makers on CFA15 and sized ~49 Mb, only 
the top 1,000 markers are shown). 


Bergmann’s rule and geographic distribution analyses 

We tested whether the spatial distribution of ‘small allele frequency’ follows the biogeographic pattern and the underlying mechanism 
described in Bergmann’s rule, i.e. number of small animals (and thus here, the frequency of “small allele C”) decreases with latitude 
and temperature.°° To ascertain worldwide map representation when the sampling geographic coordinates were missing, we used 
the calculated centroid position of the associated historical country of wild distribution using maps R package 3.3.0. When coyote 
sampling geographic coordinates were missing we used the calculated centroid position of the associated sampling U.S. state using 
maps R package 3.3.0. Using geographic location of each sample, we extracted local temperature data from CHELSA,” specifically 
the bioclimatic variable bio1 (annual mean temperature) corresponding to the 1979-2013 period. For both latitude and temperature 
that we tested separetely, we used generalized linear mixed models (GLMM; Ime4 R package; version 1.1-25°°) assuming a binomial 
error distribution, with small allele frequency used as the response variable (CC = 1, CT = 0.5, TT = 0) and the latitude (or temperature) 
used as fixed effect (herearfter, M1). We accounted for species-specific intercepts using the species identity as random effect. We 
also accounted for potential non-linear relationship by adding a quadratic term to the model (M2). We compared both models with a 
null model (MO, including the species random effect only) using the Akaike Information Criterion (AIC)°° to select the best model in 
terms of parcimony and data fitting. We assessed the proportion of deviance explained by latitude (or temperature) using the Tjur’s R? 
computed with the ‘performance’ R package version 0.7.2.°” 

We tested whether the spatial distribution of small allele and weight followed a West-East linear gradient across U.S. for coyotes. 
We first used generalized linear models,°° assuming a binomial error distribution with the small allele frequency used as the response 
variable and the longitude used as explanatory variable. We also accounted for potential non-linear relationship by adding a quadratic 
term to the model. We repeated the analysis to test if body mass followed the same West-East gradient using linear models with a 
Gaussian error distribution with body mass used as the response variable. Finally, we extended our analysis to latitude and local 
mean annual temperature for each coyote sample to test (and thus exclude) the Bergmann’s rule hypothesis which could explain 
the geographic pattern observed in coyotes across U.S. Maps and GLM figures were drawn in R (https://www.r-project.org/). All de- 
tails about GLM statistics (models, AIC, parameters, SD, Z score, P values, Tjur’s R2) are publicly available on Dataverse (https://doi. 
org/10.7910/DVN/JBXYZD). 


IGF1-AS genotyping in ancient DNA 

For the 42 samples, raw reads were filtered, allowing one mismatch to the indices used in library preparation. Adaptor sequences 
were removed using AdapterRemoval.*’ Reads were aligned using Burrows-Wheeler Aligner (BWA) version 0.7.5ar405 to can- 
Fam3.1,°’ with default parameters apart from disabling the seed option (-I 1024).°° FilterUniqueSAMCons“? was then used to remove 
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duplicates. BAM files from different sequencing lanes were merged using samtools v 1.3.1.°° Each BAM files was then re-scaled us- 
ing MapDamage v2.'° We then produced two BAM for each sample, each containing only the reads mapping to the IncRNA SNP 
region using samtools: one BAM with and one without rescaling. 

We then calculated likelihood of all ten possible genotypes at the position of interest by first running the following command in 
ANGSD v0.933:°° 

angsd -GL 1 -out < output_file_name > -doCounts 1 -i < input_file_name > -doGlf 4 -nThreads 2 -r chr15:41219654 

For the sake of numerical representation and computational efficiency, angsd reports the likelihood ratio of each genotype 
compared to the “best” one on a logarithmic scale. We thus used a custom script to rescale these values and applied a Bayesian 
framework to obtain genotype posterior probabilities. 

Let Q denote the set of possible genotypes at any given position in a genome. For a diploid individual we have that Q = {AA, AC, AG, 
AT, CC, CG, CT, GG, GT, T T}. We denote G the event that an individual possesses a specific genotype. The events G; constitute a 
partition of the sample space, i.e. 


k=10 
Gi, ..., G10|Gi NG; = @,and U G,=22 
i+j k=1 


Let D denote the data we collected from sequencing, from Bayes Theorem we have: 


= PIES) 


P(G|D) = PID) 


(Equation 1) 


where P(G) is the probability associated with the event Gj i.e., our prior, while P(G,|D) is the probability of the individual having the 
genotype Gj given the data, i.e. our posterior. Because the events G, are exhaustive and mutually exclusive, we can use the law 
of total probability to express the denominator of Equation 1: 


i=10 


P(D) = S°P(D|G)P(G)) (Equation 2) 


i=4 
Thus, we can rewrite our posterior as: 
P(D|G))P(Gy) 
LiP(D|Gi)P(Gi) 
Finally, we can explicitly express the probability of observing the data D given the event G; being true in terms of likelihood: 


P(D|G;) = k L(G|D) (Equation 4) 


P(G|D) = (Equation 3) 


where k is a positive constant which reflects the proportionality relationship between likelihoods and probabilities. We then define a; 
as the value reported by ANGSD for the i-th genotype, with L; its likelinood and Lyes: the likelihood of the best genotype. We can ex- 
press a; as: 


ai = logo zs ) L;=10*Lpest (Equation 5) 
Loest 


Therefore, 
P(D|G;) = k 107 Lpese =K 10% (Equation 6) 


where K is equal to k Lyest- 
By substituting Equation 6 into Equation 3 we obtain the following expression for our posterior probability: 


K 109 P(G)) 10%P(G)) 


P(G|D 
mae) Sony K 10% PG) Soj24°10% PG) 


(Equation 7) 


which can be simplified even further when adopting a uniform prior. 

We used this framework to compute posteriors employing two different priors: a uniform prior (all ten genotypes have the same 
probability value of 0.1) and a more realistic prior, which takes into account our knowledge that this is a biallelic site (P(CC) = 
P(TT) = P(CT) = 0.31 while each of the remaining seven genotypes have a probability of 0.01). At the end, we applied this method 
on our ancient genome dataset, and we determined the genotypes of 35 genomes (26 dogs and nine wolves). All genotype determi- 
nations are publicly available on Dataverse (https://doi.org/10.7910/DVN/JBXYZD). 


Weight estimations of ancient DNA samples 

We estimated weight for four samples: three wolves (AL3185, CGG32, CGG33) and one ancient dog (CGG6). The reference material 
used for the osteometric comparison is composed of four groups. The recent northern wolf reference group (rNw, n = 39) consists of 
mandibles from Palaearctic wolves from locations within Belgium, Sweden and Russia at latitudes above 50°N. The Pleistocene wolf 
reference group (PIW, n = 18) contains mandibles from European and Siberian natural and Palaeolithic sites dating from the pre- and 


e7 Current Biology 32, 889-897.e1—-e9, February 28, 2022 


Current Biology © CelPress 


post-Last Glacial Maximum, located at latitudes above 44°N. The Palaeolithic dog reference set (PalD, n = 18) is composed of man- 
dibles from European and Siberian Upper Palaeolithic sites, all located above 44°N. The recent northern dog reference group (rNd, 
n = 89) contains specimens from localities in Siberia, Sakhalin Island, and Greenland at latitudes above 50°N. For more details on the 
reference groups, see Germonpré et al.’°°° 

To assign the Eliseevichi mandible to one of the reference groups, a biplot was used (JMP version 15.0; significance < 0.05, SAS 
Institute) (Figure S1B). For all reference groups, density ellipses (0.95) are given. These ellipses are both density contours and con- 
fidence curves that show where a given percentage (here 95%) of the data is expected to lie; they are computed from the bivariate 
normal distribution fit to the X and Y variables. The variables, expressed in mm, are measured on the mandibles as proposed by von 
den Driesch.”° Following measurements are utilized: Total Length (TL): the total length from the condyle process to the Infradentale; 
Hp2p3: the height of the mandible between p2 and p3. 

Body mass estimates (BMe) are calculated based on the regression equations formulated on the base of a combined dataset of 
modern wolves and dogs, all of known body mass at death, by Losey et al.”' (Figure S1B). The equations include the measurements 
as defined by von Den Driesch”° for the skull (Total Skull Length, TL) and for the mandible (length from the condyle process to the 
border of the canine alveolus, LPcC). Finally, we used predicted weight based on the estimators of Harcourt and Wing from 15 Israeli 
ancient dogs published by Stager et al. in 2008.* The three samples (ASHQ01, ASHQ06, ASHQ08), all homozygous for the small allele 
(C), came from the same expedition, the same site, and with the same estimated age. Hence they were likely of the same body size 
(Figure S10; Table S3). Figures were drawn in R (https://www.r-project.org/). 


Comparative approach 

The IGF1 maximum likelihood phylogenetic tree was generated using Ensembl database (http://www.ensembl.org/) and the IGF1 
dog transcript ENSCAFT00000086858 which we compared to genomes of 288 species. Ensembl gene trees are generated by the 
Gene Orthology/Paralogy prediction method pipeline (http://www.ensembl.org/), and then generated by TreeBeST (http://treesoft. 
sourceforge.net/treebest.shtml). Ferret, panda and Carnivores (including cat) are the closest species to canids for which genomes 
are available on UCSC genome browser. We then aligned the two canine /GF1-AS transcripts (CFRNASEQ_AS_00037985, CFRNA- 
SEQ_AS_00037987) on other mammalian genomes using the BLAT tool on the UCSC genome browser (https://genome.ucsc.edu/) 
to identify conserved sequences between species. To draw Figure 3A, we zoomed in on a 40-bp sequence centered on the |GF1-AS 
variant in dogs, and manually identified the conserved nucleotides between mammals. Sequence alignments and their associated 
statistics are publicly available on Dataverse (https://doi.org/10.7910/DVN/JBXYZD). 


Detection of introgression 

To detect potential gene flow on CFA15 that may exist between canid populations,°° we used the D-statistic.2° We used the “ge- 
nomics_general” package (https://github.com/simonhmartin/genomics_general) and first analyzed the 1.9M biallelic variants iden- 
tified on CFA15, and then zoomed in on 2-Mb region (40-42Mb) spanning the /GF7 locus. The D-statistic measures the excess of 
shared-derived sites between a potential introgressor (P3) and a putatively admixed group (P2) over the shared-derived sites be- 
tween P3 and a third group (P1) that is assumed to be unadmixed and sister to the P2 group. For example, if the D-statistic (P1 = 
Grey wolf, P2, P3, Andean fox) deviates positively from 0, the result suggests that more gene flow exists between P2 and P3, while 
a negative value indicates a closer relationship between gray wolves (P1) and (P3). In absence of gene flow, D should be approxi- 
mately zero. We used the Andean fox as an outgroup to define the ancestral alleles.°° For domestic breeds, we defined two groups 
(small and large) keeping the 10 smallest CC dogs (three Chihuahuas, one Pekingese, two Pomeranian, four Yorkshire Terrier) and the 
10 largest TT dogs (two English mastiffs, four Great Danes, one lrishwolfhound, one Komondor, one Leonberger, one Scottish deer- 
hound), thus representing the extreme phenotypes for weight/height (Data S1A). We then calculated the frequency of the derived 
allele in all seven additional species: gray wolf, African golden wolf, red wolf, coyote, golden jackal, small and large dog breeds pop- 
ulations. We evaluated standard errors using a block jackknife approach with a block size of one Mb.°° The D-statistic was calculated 
separately over all combinations of species as P1, P2 and P3. We then split gray wolves by continent and ran the same analysis 
testing for geographic effects (i.e., testing potential gene flow existing between small dogs and Middle East gray wolf, for example). 
In total, we performed all the 2,400 possible comparisons using gray wolf, African golden wolf, red wolf, coyote, golden jackal, small 
and large dog breeds populations. As a note, since the four WGS coyotes used in this analysis were originally sampled from the West 
coast, we could not investigate the hypothesis of a recent introgression with wolves on East coast,”’ and represent it on Figure S2. 
Significant values were estimated following: p = 2*pnorm(-abs (jacknife Z score)). Only results for dogs and small gray wolf popula- 
tions (Middle East, Asia) were drawn using R (https://www.r-project.org/) and are represented on Figure S2. All D-statistics analyses 
are detailed and publicly available on Dataverse (https://doi.org/10.7910/DVN/JBXYZD). 


Haplotype analysis 

We first converted both domestic dog breed and wild canid vet files into plink format using vcftools*' (-plink option) and merged them 
into a single plink file using PLINK v1.9,*° keeping only variants with 90% of their genotypes (-merge-geno 0.9). Data were phased 
and haplotypes determined on CFA15 using the program Beagle v4.1,°° with sliding windows of 1,000 SNPs and a 50-SNP overlap. 
To identify which haplotype contains the |GF1-AS variant, we focused on a 2,682 base pair (bp) region centered on the mutation and 
corresponding to 38 polymorphic markers. We used the Andean fox (Lycalopex culpaeus) to define ancestral alleles°® (n.b: the wild 
canids had to be imputed and there is no reference for any of these species, which can disrupt the phasing process). Using PHASE 
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(v2.1.1),°' we identified a total of 37 haplotypes, 23 containing the IGF1-AS small allele C, and 14 with the T allele (Figure S3; the1,389 
samples are detailed in Data S1E). To perform phylogenetic analysis, we computed a pairwise identity-by-state distance matrix using 
PLINK v1.9*° (-distance 1-ibs option). Bootstrapped distance matrices were created by randomly resampling markers with replace- 
ment 100 times and input into PHYLIP®’ using neighbor and consensus to construct neighbor-joining dendrograms. Andean fox was 
used to root the tree.°° For domestic breeds we used the two groups (small and large) as defined previously (Data S1A). As a note, 
since the four WGS coyotes used in this analysis were originally sampled from the West coast, we could not investigate the hypoth- 
esis of a recent introgression with wolves on East coast,’ and represent it on Figure S3. Dendrograms were visualized using FigTree 
v1.4.4 (http://tree.bio.ed.ac.uk/software/figtree/). Raw data to construct the tree are publicly available on Dataverse (https://doi.org/ 
10.7910/DVN/JBXYZD). 


RNA-sequencing analysis and qRT-PCR 

FASTQ files were quantified to transcript per million (TPM) expression values using RSEM™ version 1.3 (options: rsem-calculate- 
expression-num-threads 10-paired-end-bowtie2) with CanFam 3.1 as the reference genome for alignment and CanFam 3.1-plus 
used to call gene counts.°° We also ran the same analysis on 51 previously published RNA-seq samples obtained from the Sequence 
Read Archive (https://www.ncbi.nim.nih.gov/sra). To confirm the RNA-Seq results, reverse transcription was performed with 1 ug of 
total RNA from testes using the High-Capacity cDNA Reverse Transcription kit (Applied Biosystems) according to the manufacturer’s 
instructions. We then performed qPCR on diluted cDNA samples (1:20 dilutions from the 1-2 j1g obtained after cDNA reverse tran- 
scription) using the Power SYBR Green PCR Master Mix kit (Applied Biosystems). qPCR reactions were run on the CFX384 TouchTM 
Real-Time PCR Detection System (Bio-rad) using standard procedures. For each sample, we performed three biological replicates 
and the experiment was performed three times. Relative normalized expressions were determined using CFX MaestroTM Analysis 
Software (Bio-Rad). Primers for |GF1, IGF1-AS and GAPDH (reference gene) were designed using Primer3plus®’ and are listed in the 
key resources table. On Figure S4, violin plots were constructed in R (https://www.r-project.org/) and P values were calculated by 
Mann-Whitney-Wilcoxon tests (***p < 0.0001). qRT-PCR raw data are publicly available on Dataverse (https://doi.org/10.7910/ 
DVN/JBXYZD). 
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